Create a term context from the BIOfid corpus¶

The BIOfid portal is aware of the semantic context of a term. A semantic context, in this context, means that the portal "knows" that a term is mentioned in the documents with other terms more or less often.

For example, you would expect the term "Fagus" (i.e. beeches) to occur very often in documents that mention the term "plants". This concept can be extended to the level that you search for "Fagus" (or the BIOfid-URI for "Fagus) and not only retrieve the BIOfid-URI (and its label) for "plants", but also other taxa (both plants and animals) that are mentioned often with "Fagus" in the texts.

The BIOfid API getTermContext allows you to retrieve the most common URIs that are associated with a given term. For example, to get all associated terms in the BIOfid corpus for "Fagus" (which has the BIOfid-URI https://www.biofid.de/bio-ontologies/Tracheophyta/gbif/2874875), you can put this URL into a script or your browser's address bar:

https://www.biofid.de/api/v1/getTermContext?term=https://www.biofid.de/bio-ontologies/Tracheophyta/gbif/2874875

This will get you a large data output. Here is an example of how the data can look:

...
{
    "uri": "https://www.biofid.de/bio-ontologies/Tracheophyta/gbif/2874875",
    "count": 1244,  // The number of articles where the given term and the "uri" are mentioned together
    "type": "Taxon",  // The type of the given "uri", currently either "Taxon" or "Location"
    "documents":[0,1,2,3,4,5,6,...., 1243],  // A reference to the document URL index in the same dataset.
    "label": "Fagus L.",  // The label for the given URI, if available
    "sameAs": ["https://www.gbif.org/species/2874875"]  // SameAs-relationship to other infrastructures.
}
...

Below you find an example of how this data can be evaluated, using the moth family of the Zygenidae. You should be able to change the TERM_STRING either for a BIOfid-URI, a Wikidata-URI, or a literal term and run the script again.

For the figure, the script filters for species explicitly (so, no genus or family data is considered).

This is the script to generate a figure for a TDWG-presenation:

Pachzelt A, Kasperek G, Lücking A, Abrami G, Driller C (2021) Semantic Search in Legacy Biodiversity Literature: Integrating data from different data infrastructures. Biodiversity Information Science and Standards 5: e74251. https://doi.org/10.3897/biss.5.74251

In [1]:
from scripts.commons import Biofid
import json
from copy import copy

TERM_STRING = 'https://www.biofid.de/bio-ontologies/Lepidoptera/gbif/8875'  # Zygenidae

biofid = Biofid()
term_data = biofid.get_term_context(TERM_STRING)

show_data = copy(term_data['results'][0])
show_data.pop('documents')  # This is a large numeric list
print(json.dumps(show_data, indent=4))
{
    "uri": "https://www.biofid.de/bio-ontologies/Lepidoptera/gbif/8875",
    "count": 27,
    "type": "Taxon",
    "label": "Zygaenidae",
    "sameAs": [
        "https://www.gbif.org/species/8875"
    ]
}
In [2]:
import pandas as pd

def is_species(uri: str) -> bool:
    try:
        taxon_data = biofid.get_biofid_data_for_uri(uri)
        return any(
            row['object']['value'] == 'https://www.biofid.de/bio-ontologies#Rank_Species'
            for row in taxon_data['data'])
    except (IndexError, ConnectionError):
        return False

associated_tracheophyta = []
for term in term_data['results']:
    if term['type'] == 'Taxon':
        if 'Tracheophyta' in term['uri'] and is_species(term['uri']):
            df = pd.concat([pd.DataFrame.from_dict({'uri': term['uri'], 'label': term['label'], 'count': [term['count']]})
                                           for d in term_data], ignore_index=True)
            associated_tracheophyta.append((term['uri'], term['label'], term['count']))

print(df.head())
                                                 uri                  label  \
0  https://www.biofid.de/bio-ontologies/Tracheoph...  Urtica dioica Fischer   
1  https://www.biofid.de/bio-ontologies/Tracheoph...  Urtica dioica Fischer   
2  https://www.biofid.de/bio-ontologies/Tracheoph...  Urtica dioica Fischer   

   count  
0      3  
1      3  
2      3  
In [3]:
import plotly.graph_objects as go

def filter_data_by_article_count(dataset: list, min_article_count: int) -> list:
    return list(filter(lambda x: x[2] >= min_article_count, dataset))

def generate_range_data(dataset, label=None, start: int = 0) -> list:
    return [label if label is not None else i for i in range(start, len(dataset) + 1)]

MIN_TAXON_ARTICLE_COUNT = 6
MIN_LOCATION_ARTICLE_COUNT = 3

associated_locations = []
for term in term_data['results']:
    if term['type'] == 'Location' and term['label']:
        associated_locations.append((term['uri'], term['label'], term['count']))

associated_tracheophyta_with_min_count = filter_data_by_article_count(
    associated_tracheophyta, min_article_count=MIN_TAXON_ARTICLE_COUNT)
associated_locations_with_min_count = filter_data_by_article_count(
    associated_locations, min_article_count=MIN_LOCATION_ARTICLE_COUNT)

taxon_group_target = generate_range_data(
    associated_tracheophyta_with_min_count, label=1)
location_group_target = generate_range_data(
    associated_locations_with_min_count, label=2)

# Include the original term as source (Index 0)
merged_sources = [0, 0]
merged_sources.extend(taxon_group_target)
merged_sources.extend(location_group_target)

merged_targets = [1, 2]
merged_targets.extend(
    [i for i in range(3, len(taxon_group_target) + 2)]
)
start_count_locations = len(taxon_group_target) + 1
merged_targets.extend(
    [i for i in range(start_count_locations, len(location_group_target) + start_count_locations)]
)

labels = [
    term_data['results'][0]['label'],
    'Tracheophyta',
    'Locations'
]

labels.extend(term[1] for term in associated_tracheophyta_with_min_count)
labels.extend(term[1] for term in associated_locations_with_min_count)

taxon_values = [term[2] for term in associated_tracheophyta_with_min_count]
location_values = [term[2] for term in associated_locations_with_min_count]

merged_values = [
    sum(taxon_values) + 5,
    sum(location_values) - 5
]
merged_values.extend(taxon_values)
merged_values.extend(location_values)

x_values = [None for i in range(0, len(labels) + 1)]
y_values = [None for i in range(0, len(labels) + 1)]
y_values[2] = 0.6

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      label = labels,
      x = [0, 0, 0.5],
      y = [0, 0, 0.8]
    ),
    link = dict(
      source = merged_sources,
      target = merged_targets,
      value = merged_values
  ))])

fig.write_image('term-associations.png', scale=3)
fig.show()

Conclusion¶

The generated figure illustrates the that the Zygenidae are associated in the BIOfid corpus with indicator species for calcereous grassland. Since Zygenidae have a preference for this ecosystem, this is not surprising. However, this relation was established by the BIOfid portal only by indexing documents, no Machine Learning involved.